Utilizing Protein Structure to Identify 
Non-Random Somatic Mutations 

Gregory Ryslik 1 , Yuwei Cheng 2 , Kei-Hoi Cheung 2,3 , Yorgo Modis 4 , 

and Hongyu Zhao 1 

department of Biostatistics, Yale School of Public Health, New 

Haven, CT, USA 
2 Program of Computational Biology and Bioinformatics, Yale 

University, New Haven, CT, USA 
3 Yale Center for Medical Informatics, Yale School of Medicine, 
New Haven, CT, USA 
4 Department of Molecular Biophysics & Biochemistry, Yale 
University, New Haven, CT,USA 

February 28, 2013 

Abstract 

Motivation: Human cancer is caused by the accumulation of somatic 
mutations in tumor suppressors and oncogenes within the genome. In the 
case of oncogenes, recent theory suggests that there are only a few key 
"driver" mutations responsible for tumorigenesis. As there have been sig- 
nificant pharmacological successes in developing drugs that treat cancers 
that carry these driver mutations, several methods that rely on mutational 
clustering have been developed to identify them. However, these methods 
consider proteins as a single strand without taking their spatial structures 
into account. We propose a new methodology that incorporates protein 
tertiary structure in order to increase our power when identifying muta- 
tion clustering. 

Results: We have developed a novel algorithm, iPAC (identification of 
Protein Amino acid Clustering), for the identification of non-random so- 
matic mutations in proteins that takes into account the three dimensional 
protein structure. By using the tertiary information, we are able to de- 
tect both novel clusters in proteins that are known to exhibit mutation 
clustering as well as identify clusters in proteins without evidence of clus- 
tering based on existing methods. For example, by combining the data in 
the Protein Data Bank (PDB) and the Catalogue of Somatic Mutations 
in Cancer, our algorithm identifies new mutational clusters in well known 
cancer proteins such as KRAS and PI3KCa. Further, by utilizing the ter- 
tiary structure, our algorithm also identifies clusters in EGFR, EIF2AK2, 
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and other proteins that are not identified by current methodology. 

Availability: R package available on Bioconductor at 

http: //www .bioconductor . org/packages/2 . 12/bioc/html/ iPAC .html 

Contacts: gregory.ryslik@yale.edu; hongyu.zhao@yale.edu 



1 Introduction 

Cancer is one of the most widespread and heterogeneous diseases imposing 
a huge toll on patients, relatives, friends, and society. However, at its most 
basic, it is a genetic disease that is caused b y the accumulation of somati c 
mutations in oncogenes and tumor suppressors (jVogelstein and Kinzlerl . 12004 ). 
While mutations in tumor suppressors tend to down-regulate the activity of 
genes that prevent cancer, mutations in proto-oncongenes either up-regulate 
or deregulate the activities of the resulting proteins. So far, pharmacological 
intervention has shown to be more successful inhibiting the activating onco- 
genes than restoring tumor suppressing gene function. Coupled with the idea 
of "oncogene addiction" , that many cancers rely on mutations in a small sub- 
set of key genes to be able to continue their uncontrolled growth while the 



rema i nder of the mutations co nstitute passenger mutations (|Greenman et al. 
l2007t IWeinstein and Joel . [2006) , the problem of identifying activating oncogenic 
mutations has received great attention in cancer research. 

Recently, several studies have shown support for the hypothesis that activat- 
ing somatic mutations tend to cluster in prote i n kina ses ( Torkamani and SchorkL 
20081: iGreenman et all . 120071 : iBardelli et all . l2003h . Further, as observed by 
Ye et all ^oTof T mutational clusters might provide further information regard- 
ing where to look for activating mutations, reducing the driver mutation search 
space needed to be analyzed. Moreover, mutational clusters that lead to ei- 
ther beneficial or detrimental phenotypic changes may point to regions that are 
under positive or directional selection as well as regions th at are function ally 
significant and thus can be targeted by protein engineering (|Wagnerl . l2007h . 

So far, several methods based upon the number of mutations in a specific re- 
gion have been developed to detect potential driver oncogenic mutations as well 
as naturally selected regions. One common method hypothesizes that driver 
mutations have a higher n on-synonymous mutation rate as c ompa red to the 
background mutation rate (|Siblom et all l2006t IBardelli et all . l2003h . Further, 
one can look at the ratio of non synonymous (K a ) to synonymous (K s ) changes 
per site, J^ 1 - ( Kreitma 3, l200dh . A criterion for selection is then to check if 



> 1, based on the hypothesis that the benchmark neutral rate of nucleotide 
substitution is exceeded whe n posi ti ve se lection also contributes to the sub- 
stitution process. Similarly, IWand (l2002h proposes a hypothesis that driver 
mutations have a larger mutational rate than the background mutational rate 
after gene length normalization. 

While the approaches mentioned above have had some success in detecting 
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positive selection and/or identifying driver mutations, they nevertheless have 
several shortcomings. First, many of them are dependent on calculating the 
disparity in non-synonymous versus synonymous mutations but do not recognize 
that selection often occurs on very small sections of the gene and thus might 
fail when average d over the entirety of the gen e length. Second, the methods 
described above (|Wand . l2002t iKreitman . 2000) do not make any attempt to 
distinguish between activating and non-activating non-synonymous mutations. 

In addition to the approaches described above, some researchers have fo- 
cu sed on creat i ng cla ssifiers in order to determine mutation status. As described 



Reva et ali (|201l[ ) , these algorith ms employ a va riety of machine learning 



techni ques, such as Random For ests (jBreimanl . 120011 ) and Support Vector Ma- 
chines (jCortes and Vapnikl . ll995l ). to calculate a score for each mutation. These 
scores are typically calculated using a combination of physico-chemical prop- 
erties such as evolutionary conservation, size and polarity of substituted and 
original residues as well as surface accessibility. T hese scores are then us ed to 
classify the mutation. For example, PolyPhen-2 ( Adzhubei et al\. 2010) pre- 



dicts whether a missense mutation is damaging while CHASM ([Carter et al. 



2009) attempts to discriminate between driver and passenger mutations. While 



several of these models have had significant success in classifying the mutation, 
they all require large and well annotated data sets in order to first train the 
machine lear ning class i fier an d then apply the resulting rule set. 

Recently, lYe et al. ( 2010l ) developed Non- Random Mutational Clustering 
(NMC) to identify potential activating mutations by hypothesizing that, in 
the absence of heretofore known mutational hotspots, a mutational cluster is 
indicative of selection for an activating driver mu tation since only a smal l num- 
ber of precise mutat ions can activate a protein (jTorkamani and Schor 5 120081 : 
Bard elli et al. . 120031) . By looking at the order statistics and assuming that the 



locations of amino acid mutations follow a uniform distribution when the protein 
is considered in linear form under the null hypothesis, they identify clusters by 
calculating whether any two pair-wise mutations are closer together on the line 
than expected by chance alone. Despite its success, one limitation of the NMC 
method is that the proteins are treated as a linear sequence without considering 
the three dimensional structures of the proteins. 

In this work, we extend the NMC methodology to account for tertiary pro- 
tein structure. This enables the identification of mutational clusters that are 
relatively far away in linear space but relatively close together in 3D space. We 
proceed to show that our methodology is effective in identifying novel muta- 
tional clusters that are missed by NMC in key cancer proteins such as KRAS and 
PIK3Ca. Unlike NMC, iPAC is also able to identify the EGFR and EIF2AK2 
proteins as containing mutational clustering as well. We also show that many of 
the clusters identified by iPAC are predicted t o be deleterious by well known ma- 
chine learning algorithms such as Polyphen-2 ( Adzhubei et al. . 2010). However, 
iPA C has the distinct advantage of requiring only the mutational positions and 
tertiary structure which allows its application to novel mutations and structures 
for which extensive information and literature is not yet available. Finally, we 
also show that for a large percentage of protein structures, the tertiary struc- 
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ture leads to a net reduction in mutational clusters found, thus presenting a 
simplified clustering mutational landscape. Ultimately, by providing a refined 
picture of the mutational clustering, we are are able to provide a more accurate 
representation of where potential activating mutations may reside within the 
protein. 



2 Methods 

Our method, named iPAC, uses a 4 step approach to finding mutational clus- 
ters. First, mutatio nal and po sitional data a r e obt ained from the COSMIC 
( Forbes et al , 2008 ) and PDB ( Berman et al , 2000h databases (described in 



Sections 12.11 and 12.21 respectively) . The mutational and positional information 
is then reconciled to allow a single numerical reference to identify the same 
physical amino acid in both databases (S ection |2~B"|) . Next, MultiDimensional 



Scaling (MDS) (jBorg and Groenenl . 119971 ) is used to map the protein structure 



from 3D to ID space while preserving, as best as possible, all pairwise three di- 
mensional distances between amino acids for a given protein (Section 12. 4[) . The 
NMC algorithm is then run on the remapped amino acids to find mutational 
clusters (Section 12. 5|) . Finally, the clusters are mapped back into the original 
protein space and reported back to the user. In the following subsections we 
discuss each of these steps in detail. 

2.1 Obtaining Mutational Data 

Mutational data were obtained from the COSMIC database (version 58) via 
ftp://ftp.sanger.ac.uk/pub/CGP/cosmic and implemented using Oracle. In 
order to justify the assumption that amino acids follow a uniform distribu- 
tion of mutation, only mutations that were found through whole gene screens 
were included. Further, we only used missense mutations that belonged to two 
categories: 1) "Confirmed somatic variant" or 2) "Reported in another cancer 
sample as somatic" . All nonsense and synonymous mutations as well as mu- 
tations that had different somatic status categories were excluded. Further, as 
multiple studies can report mutational data from the same cell line, mutational 
redundancies were removed to avoid double counting. See "COSMIC query" in 
the supplementary information for the SQL code and schema used to generate 
the data. Finally, in order to match mutational data w ith structura l data, only 
the proteins for which a UniProt Accession Number ( Consortium! . 120111 ) was 
available were kept. This resulted in 777 unique proteins. 

2.2 Obtaining the 3D Structural Data 



The protein structural data were obtained from the PDB database via http : / / www . pdb . org 
As one protein can have several structures, for each of the 777 proteins described 
above, all the structures with a matching UniProt Accession Number were ob- 
tained. If a specific structure had more than one polypeptide chain with a 
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matching amino acid sequence in UniProt, the first matching chain listed was 
used (typically chain A) . For proteins where the resolution was sufficiently high 
enough to provide more than one alternative conformation for a specific amino 
acid side chain, only the first conformation listed in the file was used. Once the 
appropriate side chain and conformation was selected, the (x, y, z) coordinates 
of all the a-carbon atoms were extracted and used to represent the 3D back- 
bone structure of the protein. In all, this process resulted in 1,904 structures. 
See "Structure Files" in the supplementary information for a full listing of the 
structures and side chains used for each protein considered. 



2.3 Reconciling the Structural and Mutational Data 

Due to a different numbering system of the amino acids employed by the PDB 
and COSMIC databases, an alignment needed to be performed in order to refer- 
ence the same residue numerically in both databases. Two methods in the iPAC 
packa ge were designed to reconcile these differences, one based on pairwise align- 
ment (Pages et all . l2012h and the other based on a numerical reconstruction 



from the structural data obtained from the PDB. As there are often significant 
technical difficulties for such a reconstruction, for the rest of this paper, un- 
less specifically noted, pairwise alignment was used to reconcile these elements. 
Please see the documentation in the iPAC package for a full description of these 
two methods. Successful alignment of mutational and positional data occurred 
on 140 proteins which corresponded to 1100 unique structure/side-chain combi- 
nations and 667 unique residue positions containing 1,434 total mutations. We 
note that for any given structure/side-chain combination, if there is no posi- 
tional data for a specific residue, the mutational data for that residue is not 
used. Please see "Structure Files" in the supplementary information for a full 
description. 



2.4 Multidimensional Scaling 

As the underlying clustering algorithm is dependent up on the construction of or- 
der statistics, we used MDS (|Borg and Groenenlll997l) to remap the amino acids 
into one dimensional space while preserving (as best as possible) the pairwise 
distances between them in 3D space, 
matrix, 

62,1 



Specifically, given an n x n dissimilarity 



£1,2 
£2,2 



&2,n 



) 



the MDS algorithm maps each Sij into a corresponding distance d^j-(X) on a 
new m-dimensional metric space X. Formally, for a specific representation func- 
tion, / : Si j — > d 2 J (X), we have that the original dissimilarities are preserved in 
X, specifically, f{5i,j) = dij(X). Here, / can be either fully defined or chosen 
from a specified class of functions and is employed to handle the case when 
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the proximity measures come from a space that is not necessarily a true metric 
space. Further, as it is not always possible to preserve the exact distance (for 
example, due to sampling effects, measurement precision or loss of dimensional- 
ity), rather than insist on f(Sij) = di j{X), the MDS framework is typically set 
up such that f(6ij) ~ c£ij(X). Thus, by minimizing a badness-of-fit measure 
called raw stress = a r = J2i j[f($i,j) ~ dij(x)] 2 , we identify the x 1; ...,x n , that 
preserve our distances in the new metric space X. However, raw Stress by itself 
is not always informative as it is subject to distortion by the choice of units 
used. For instance, if the scale used to measure changes by a factor of 100, the 
raw stress will change as well but by a factor of 100 2 . Thus, Stress- 1, which is 
defined as: 



E«[/(^)-<Mx)] 2 

— t-^uv — (1) 

and is not subject to unit distortion, will be minimized instead. 

For the purposes of this paper, the dissimilarity matrix is simply equal to 
pair-wise distance between any two amino acids in the protein. Specifically, the 
distance between residues i and j, denoted 5ij, is taken to be the Euclidean 
distance between their respective a-carbon atoms. As Euclidean space is a 
proper metric space, from now on we assume that / is the identity function. 
Further, as we require units along the line in order to calculate order statistics, 
the MDS algorithm will be applied such that we find Xi, x n £ K 1 . Thus, the 
MDS algorithm finds scalars x\, ...,x n such that \xi — Xj\ ~ Si.j, for any two 
pairwise amino acids i and j in the protein. We present an ex ample when MDS 
is applied to the 3GFT structure of KRAS dTong et all l2009h in Figures ffl and 
[2] below. 

2.5 NMC 



We employed the NMC algorithm (|Ye et all l201dh to find the mutational clus- 



ters in one dimensional space. Specifically, consider a protein with N amino 
acids and that each amino acid has a uniform probability of -k of mutation. 
Given m samples and n mutations, we are able to calculate the order statistics 
for every mutation (see Figure [3]). Two mutations Xu\ and X/f.) are then de- 
fined to be clustered if, Pr(Cki — X^) — -^(i)) — a - This probability is then 
calculated for every pair of mutations and adjusted fo r multiple comparisons us- 
ing e ither the Benjamini-Hochberg (BH) adjustment (jBeniamini and Hochberel 
19951) or the Bonferroni adjustment. For the analyses performed in this paper, 



the more conservative Bonferroni adjustment was used. Finally, it is important 
to note that the structural information obtained for each protein often does 
not include positional information on every amino acid within the protein. We 
removed these "missing" amino acids from the protein before running the NMC 
clu stering algorith m so that we can compare iPAC and NMC on an equal basis. 
lYe et all pOlCh derive closed form solutions to calculate the Pr(Cki = c) for 



c G {0, 1, N — 1}. However, as this becomes computationally inefficient, they 
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MDS Mapping 

KRAS Alpha Carbon Positions 




x-axis 



Figure 1: KRAS a-carbons in 3D Space. 



Figure 2: KRAS a-carbons mapped to 
the x-axis using MDS. 
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Figure 3: An example of constructing the order statistics. Suppose we had 3 samples of a 
protein that is N amino acids long. If amino acid i has a "*" above it, that indicates that the 
amino acid for that sample had a non-synonymous missense mutation. The samples are then 
collapsed together and the number of mutations for each residue is shown above the box on the 
right. These counts form the order statistics. The first mutation is on residue 2 (^(1) = 2), 



the next 3 mutations are on residue 3 (-f (2) = -^(3) = -^(4) = 3) 



the next mutation is on 



residue 5 (^(5) = 5) and the last 2 mutations are on residue 6 (^(6) = X, 



(7) 



6). 



suggest dividing Cm by N and assuming a continuous uniform distribution on 
(0, 1). They then show that in the limit, the CDF becomes as follows: 



Cm = X (k) -X {l) 
N N ~ ' 



n 



/o (k - i - + n - k)V 
= Pr(Beta(k - i, i + n - k + 1) < c) 

Thus, via Equation ([2]), we can directly calculate if two mutations are closer 
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together than by chance quickly and efficiently. 



2.6 Multiple Comparison Adjustment For Structures 

In addition to the Bonferroni multiple comparison adjustment done by the NMC 
method, an adjustment is also required to account for testing multiple structures 
per protein. Since the structures for a given protein could be quite similar and 
thus lead to similar clustering results, a second Bonferroni adjustment would 
be too conservative. Instead, a combined Bonferroni-FDR approach was per- 
formed as follows. First, for a given protein, the NMC reported p-value for a 
given cluster was multiplied by " ( -" 2 ~ 1 ' ) , to calculate P*. Thus, on a per-protein 
level, P* represents the inverse Bonferroni adjustment performed by the NMC 
algorithm and thus allowed us to compare each cluster's P* to an a-level of 
0.05 to determine significance. To account for all t he structures analyzed, we 
computed a rough FDR (rFDR) (jGong et aZl |2009[ ): 



k + 1 

rFDR = a* 

2k 

where k is the total number of structures. In the case of the 1100 struc- 
tures analyzed in this study, rFDR » 0.02502. Finally, any clusters for which 
P* < 0.02502 was deemed to be significant. For the rest of this paper, with the 
exception of Table [U we only report the p-value to avoid confusion. Neverthe- 
less, each cluster presented in Section [3] is in fact significant after adjusting for 
structural multiple comparisons. 



3 Results 

Using the iPAC package, 215 of the total 1100 structures analyzed were found 
to have significant clustering. When comparing iPAC with the original NMC 
method, out of the 140 proteins analyzed, both iPAC and NMC identified 8 
proteins that contained significant clusters. However, iPAC also identified 3 new 
proteins as well, specifically EGFR, EIF2AK2 and HAOl. These 3 new proteins 
correspond to 10 of the 215 structures found to have clustering. iPAC also found 
structure 2ENQ for the protein PIK3CA to contain a significant cluster while 
NMC did not. The 8 proteins identified by both algorithms correspond to the 
remaining 204 structures. There were no proteins that were identified by NMC 
but were subsequently missed by the iPAC algorithm. Please see "Results 
Summary" in the supplementary materials for a full listing of which structures 
and which proteins were found to be significant. 

As can be seen from Figure HI approximately 70% of all the structures found 
to have significant clustering differed in the amount of clusters identified when 
comparing iPAC vs NMC. This leads one to believe that in some cases, con- 
sideration of the tertiary structure identifies additional clusters while in other 
cases, clusters are able to be removed, offering a simplified view of the muta- 
tional information. While it is outside the scope of this paper to consider every 
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iPAC (3D) vs NMC (Linear) Comparison 




■ iPAC< NMC 



Figure 4: A comparison of NMC and iPAC over all the structures that were found to be 
significant. 

one of the 215 structures with clustering, we present three representative cases 
where integration of the tertiary protein structure into the analysis had a sig- 
nificant effect: 1) identification of mutation clustering in a protein that would 
otherwise be missed, 2) identification of new mutation clusters in a protein that 
was detected using the NMC methodology, and 3) reduction of the total muta- 
tional clusters in a protein that was detected using the NMC methodology. We 
also note, as can be seen in Table [TJ that the p- value found for the most signif- 
icant cluster is similar on the protein level. Proteins that had very significant 
clustering, such as KRAS and TP53, remain very significant when the tertiary 
structure is incorporated. Proteins that were less significant, such as IDE and 
AKT1, remain so as well. 





iPAC 


NMC 


Protein 


P-value 


p* 


P-value 




KRAS 


6.17 E-185 


6.35 E-181 


4.39 E-233 


4.52 E-229 


TP53 


5.23 E-128 


6.11 E-123 


4.37 E-086 


5.30 E-81 


BRAF 


3.73 E-130 


1.01 E-126 


3.84 E-130 


1.04 E-126 


PIK3CA 


8.20 E-084 


3.58 E-80 


8.20 E-084 


3.58 E-80 


NRAS 


5.38 E-026 


6.46 E-24 


8.26 E-029 


9.91 E-27 


HRAS 


1.23 E-010 


5.54 E-09 


5.61 E-010 


8.42 E-09 


AKT1 


1.18 E-005 


7.08 E-05 


2.47 E-005 


7.41 E-05 


IDE 


2.20 E-005 


6.60 E-05 


1.56 E-003 


4.67 E-03 



Table 1: A comparison of the most significant iPAC and NMC p- values from the 8 proteins 
that were picked up by both algorithms. P* is calculated as described in Section 12.61 



We note that 9 out of the 11 proteins that were found significant by iPAC 
had their most significant cluster overlap a binding site, proton acceptor site 
or kinase domain. For the remaining 2 proteins, the most significant cluster 
for PIK3CA overlapped amino acid 1047 which has been shown to ease the 
entrance of substrates and hence p otentially incr e ase th e substrate turnover 



rate, a typical oncogenic behavior ( Mankoo et al , 20091) . For a detailed per 



protein description, please see "Relevant Sites" in the supplementary materials. 
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Finally, we considered the performance of iPAC as compared t o two popular 
machine learning al gorithms, PolyPhen-2 ( Adzhubei et all 2010() and CHASM 
( Carter et aD, [2009). First, a direct comparison must be considered in light of 
the fact that these algorithms require a much more extensive set of informa- 
tion than iPAC . Nevertheless, over 98% of the amino acids that occurred in 
significant mutation clusters were also identified as significant (with a FDR of 
< 20%) by Polyphen-2 and CHASM. For full details, please see "Performance 
Comparison" in the supplementary materials. 



3.1 iPAC finds novel proteins 

As discussed Section [31 three new proteins were identified by iPAC that were 
missed when tertiary structures are not accounted for. The EGF R protein, a 
cell-surface receptor for epidermal growth factor family ligands (jHerbstl . 120041 ) , 
is perhaps t he most well known an d has been found in a wid e array of cancers 
such as lung dScagliotti et al. l 2004), anal C Walker et al l l2009h and glioblastoma 
multiforme ([Heimberger et all 120051 ). Although seven EGFR structures were 
identified by iP AC to contain signi ficant clustering, we will concentrate on the 
2GS7 structure ( Zhang et al. , 20061 ) as it showed the most significant clustering. 
As seen in Table [5J three significant clusters were found with cluster 3 being 
being a sub-cluster of cluster 1. Figure [5l shows the orientation of these clusters 
in three dimensional space. 



Cluster 


Start 


End 


Muts. in Cluster 


P-Value 


1 


751 


858 


4 


1.35E-04 


2 


719 


751 


2 


2.41E-03 


3 


790 


858 


2 


2.82E-03 



Table 2: The three most significant clusters found in EGFR for the 2GS7 structure. 



Overall, all the statistically significant clusters found deal with lung cancer 
pathology and an increase in kinase activity. The two mutations in cluster 2, 
G719S and T751I are both found in lung ca ncer with the first mutation respon- 



sible for st r ongly increased kinase activity (|Yun et all l2007t iTam et all . 12006 



Paez et all 120041 ) and the seco nd found in erlotinib responsive non small cell 



lung cancer patients (NSCLC) (|Peraldo-Neia et al. . 2011 ; Tsao et al . 2005 ). re- 
spectively. Cluster 3 contains two mutations, T790M and L858R, both of which 
have been found in lung cancer and are known for increa s ed kin ase activity as 
well (|Yun et all l2008l 120071 : ITam et all . l2006t IPaez et all l2004l ). Finally, clus- 
ter 1 is comprised of clusters 2 and 3, with an additional mutation S768I which 
otentially shows a positive clinical response to Getfinib in NSCLC patients 
" 20101) . It is interesting to note that both clusters 1 and 2, that 



Masago et al 



arc identified via statistical analysis, contain mutations that have been found to 
benefit from pharmacological intervention. Had the tertiary structure of EGFR 
not been taken into account, these clusters would not have been identified by 
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Figure 5: The 2GS7 structure color coded by region: 1) cluster 1 - orange, 2) cluster 2 - 
blue and 3) cluster 3 - yellow. The boundary o-carbon amino acids of 719, 751, 768, 790 and 
858 are shown as purple spheres. 



the NMC algorithm. When the protein is viewed linearly, the mutations occur 
too far away from each other to result in statistically significant p-values. 

3.2 iPAC finds additional clusters 

One example where iPAC finds additional clusters is in the KRAS protein when 
analyzing the 3GFT structure^ (|Tong et aZ.I . l2009h . KRAS, part of the RAS set 



of of proteins which are involved in a large number of signaling cascades, is one of 
the most studied cancer onco genes with activatin g mutations in approximately 
17-25% of all human cancers (|Kranenburgll2005l ). While both NMC and iPAC 



identified many of the same clusters such as amino acids 12-13, 12-61 and 12- 
146, iPAC identified several novel clusters as well, specifically amino acids 61- 
117 and 117 -146. We note that both algorithms specifically identify a cluster 
between residues 12 and 146, and given that we only have positional data for 
167 residues, signifies that there is one large cluster that covers ss 80% of all the 
available amino acids. However, combined with the two novel clusters identified 
by iPAC, we are able to partition the protein into three distinct regions 1) 12- 
61, 2) 61 - 117 and 3) 117-146 that cover 30%, 34% and 18% of the protein 
respectively (see Figure H]). 



1 For this analysis, we included included mutational and positional data only on residues 
1-167. No 3D positional information was available in the 3GFT structure on residues 168- 
188, and these residues were removed before the analysis. Further, the structural information 
has amino acid 61 as a histidine (isoform 2B for KRAS in the Uniprot Database) while the 
COSMIC database has a glutamine in that position. However, as the substitution of one amino 
acid in the structure for another would not have a significant affect on its spatial orientation 
and as amino acid 61 has a large number of somatic mutations, it was kept in the analysis. 
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Figure 6: The 3GFT structure color coded by region: amino acids 13-60 are blue, 62-116 
are red and 118-145 are yellow. The boundary a-carbon amino acids of 12,61,117 and 146 are 
shown as purple spheres. 



We also ran NMC and IP AC on each region separately to consider how the 
clustering results would be affected. As can be seen from Table El failure to 
account for the tertiary protein structure resulted in region 3 no longer being 
detected and region 1 losing significance by over ninety orders of magnitude. 





P-value 


Region 


NMC 


iPAC 


1) 12-61 


1.37E-11 


3.36E-105 


2) 61-117 






3) 117-146 




3.35E-12 


2&3) 61 - 146 




3.31E-05 



Table 3: P-value for each region when the region is considered independently, 
signifies that the region was not found to be significant. 



Further, while somatic mutations in region 12-61 have been found in many 
cancers su c h as c olorectal, l ung, pancreatic and b l adder dSiblom et al 



Tarn et al . 20061: Lee et al , 19951 Motoiima et al . 1993 : Nakano et al 



2006 



1984; 



Santos et all 11984), somatic mutations at amino acids 61, 117 and 146 have 
primarily been found in lung and colorectal carcinomas. Even more specifi- 
cally, mutations at amino acids 117 and 146 (K — » N and A — > T, respectively) 
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deal mostly with colorectal cancer dSiblom et aZj . l2006ft . Thus, by taking into 
account the tertiary structure, the clusters identified by iPAC subdivide the 
protein along pathological lines. 



3.3 iPAC finds fewer clusters than NMC 

Of the 215 structures found to contain significant clustering, 86 structures were 
identified where iPAC found fewer structures than NMC. Three of these struc- 
tures correspond to BRAF, 31 corr espond to HRAS and 52 c orrespond to TP53. 
Here, we consider structure 3TV4 (IWenglowskv et al. 1. 120 111) for the BRAF pro- 



tein as it contains the most significant cluster found by both iPAC and NMC. 
For this protein, it is well known that amino acid 600 is one of the most highly 
mutated residues. In our dataset, 60 of the 76 total mutations that fulfilled the 
requirements described in section [2TT1 occurred on amino acid 600. As expected, 
the most significant "cluster" is located solely on that amino acid, with an iPAC 
p- value of 3.73 x 10~ 130 and an NMC p- value of 3.84 x 10~ 130 . However, in total, 
iPAC identifies 9 clusters for this structure while NMC identifies 19, with the 
differences shown in Table |H 




Figure 7: The 3TV4 structure color coded by region: 1) Amino 464-600 are blue 2) Amino 
Acids 601-671 are orange. The a-carbons of the mutated amino acids 464, 466, 469, 581, 596, 
597, 601 and 671 are shown as purple spheres. Amino acid 600 is colored red. 

While it is outside the scope of this paper to consider all the differences 
between Tables l4"al and l4bl we would like to point out that, contrary to iPAC, 
the NMC algorithm reports the two longest clusters: 1) 464 - 671 (p-value = 
6.01 x 10~ 9 ) and 2) 469-671 (p-value = 2.38 x 10~ 8 ). After alignment of the 
structure as described in Section 12.21 we only have structural information on 
amino acids 448 - 723. Thus, the largest cluster detected by NMC covers « 75% 
of all the amino acids that we are considering. However, by taking into account 
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P-value 


Start 


End 


# Muts. 


iPAC 


NMC 


600 


600 


60 


3.73 E-130 


3.84 E-130 


469 


600 


70 


9.76 E-122 


5.63 E-16 


600 


601 


62 


3.10 E-79 


1.35 E-117 


597 


600 


62 


4.05 E-77 


2.20 E-105 


464 


600 


71 


1.25 E-73 


1.74 E-16 


596 


600 


64 


3.06 E-73 


8.28 E-103 


581 


600 


66 


1.99 E-51 


2.96 E-64 


600 


671 


63 


7.78 E-15 


3.54 E-28 


469 


469 


4 


7.50 E-04 


7.50 E-04 



(a) Clusters found by both NMC and iPAC 



Start 


End 


# Muts. 


NMC Pvalue 


597 


601 


64 


8.28 E-103 


596 


601 


66 


9.97 E-102 


581 


601 


68 


8.73 E-67 


596 


671 


67 


1.10 E-31 


597 


671 


65 


1.93 E-29 


581 


671 


69 


2.22 E-28 


464 


601 


73 


7.09 E-19 


469 


601 


72 


3.58 E-18 


464 


671 


74 


6.01 E-09 


469 


671 


73 


2.38 E-08 



(b) Clusters dropped by iPAC 



Table 4: The significant clusters found by both iPAC and NMC are shown in Table Hal 
The clusters that were not deemed significant by iPAC but were deemed significant by NMC 
are shown in Table Flbl 

the 3D structure of the protein, these ultra-long clusters are dropped and the 
clusters where iPAC and NMC overlap show 2 distinct areas of the protein, 
amino acids 464-600 and 600-671. As expected, as the majority of mutations 
occur on amino acid 600, both NMC and iPAC declare that the "cluster" located 
at amino acid 600 is highly significant. 

Further, as described below, by considering only the clusters when tak- 
ing into account the 3D structure (see Figure [7]), the results again tend to 
fall along pathological function. After applying the methodology described 
in Section 12.11 the mutations that were found to be in significant clusters 
included G464V, G466V, G469V, G469A, N581S, G596R, L597V, LV597R, 
V600E, V600K, K601N and R671Q. As R671Q was found in only one sample 
within the COSMIC database and does not have extensive literature, we will 
not include it in further discussion. Taking into account the 3 most significant 
clusters picked up by iPAC and NMC, we now consider the protein in 3 parts: 
A) Residues 469 - 599, B) Residue 600 and C) Residue 601 (we have slightly 
adjusted the clusters displayed in Table l4"al to avoid overlap). The mutations 
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liste d that fall with region A, correspond primarily to lung and colorectal can- 
cer dGandhi et aLl.l200a 
120031 iDavies et al. 



2002 



Pratilas et all 20081: iGreenman et all . l2007t iLee et ah , 
Naoki et all 120021) . legion B, which is comprised of 



only amino acid 600 is by far the most common mutation with BRAF. This mu- 
tation results in constitutive and elevated kinase activity and has been found 
in a large range of cancers including colorectal carcinoma, ovarian serous car- 
cinoma, metastatic melanoma and pilocytic astrocytoma. Further, supporting 
the hypothesis that somatic clusters might provide pharmacological targets, it 
has already been shown that suppression of this cl uster in melanoma causes 
tumor growth arrest and helps promote apoptosis (lAndreu-Prez eta~j. ,_ 2011 



Greenman et al. . 2007 ; Siblom et ad 12006 : Hingorani et al. . 20031 : Raiagopalan et al 
2002t Davies et al. . 20021) . Finally, the K601N mutation in region C has been 



found in multiple myelo ma patients who also may benefit from BRAF inhibitors 



(jChapman et all [2011) 



4 Conclusion 

In this paper, we extended the existing methodology available to find somatic 
mutation clustering by utilizing the information provided in the protein tertiary 
structure. In doing so, we showed that we are able to find both new proteins 
with clustering as well as new clusters in previously found proteins. We have 
also shown that by taking into account 3D structure, we are able to remove 
clusters that do not have biological meaning. The method is fast and robust, 
with the vast majority of proteins analyzed within 5-10 minutes when executed 
on a desktop with 8 GB of DDR3 RAM and an Intel i7 3600k processor running 
at a frequency of 3.40 GHZ. Further, as the underlying calculation relies upon 
the NMC algorithm, a preset fixed windo w size is not req uired which allows for 
the detection of clusters of various lengths ( Ye et a/.l . l2010l ). We have also shown 



that by employing a completely statistical methodology, we are able to identify 
mutations that when may be suppressed via pharmacological intervention and 
stop further tumor growth. 

This methodology, while an improvement on the NMC method, still suffers 
from some limitations. First, the mutation status of all the amino acids must 
be determined although with the advent of high-throughput sequencing, this 
will become less of an issue as time progresses. Also, both hypermutability of 
genomic locations and unequal rates of mutagenesis might violate the assump- 
tion that each amino acid has a uniform mutation probability. For instance, it 
is well known that hypermutable positions for both somatic and germline muta- 
tions exist. Insertions and deletions that are typically sequence dependent have 
been removed from the analysis and only missense substitutions of single amino 
acids have been kept in this study to help reduce such uniformity violations. 
Similarly, CpG dinucleotides can ha ve mutational freque ncy that is ten times 
or more that of other dinucleotides (Svc d and Bird . 1990). However, less than 



13% of the mutations used to find clus tering in Section s 13- H 13.21 and 13.31 were 
in CpG sites. Further, as described bv lYe et~al ( 2010h . tobacco smoking pref- 
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erentially causes transversions in lung ca ncer while the mutati onal landscape 



for colorectal cancer has more transitions (Hollst ein et all 119911) . Nevertheless 



in the context of KRAS, the vast majority of mutations occur on amino acids 
12, 13 and 61 for both lung and colorectal cancer. This suggests that while the 
mutational spectrum may be different, it does not have a large effect on the 
position of mutations and thus the uniformity assumption. As with previous 
studies, while this analysis is influenced by nonrandom factors, it nonetheless 
appears that selection of a cancer phenotype is the primary cause of clustering. 

It should also be noted that while iPA C is designed to take tertiary structure 
into account, it is only able to do so by appealing to the MDS methodology. 
Future research is required in order to relax this restriction to potentially identify 
additional clustering results. Finally, as shown in Section [3l iPAC finds fewer 
clusters for a significant percentage of the structures analyzed. This reduction 
in total clusters can come from two sources: the removal of some amino acids 
due to lack of tertiary position information or that the cluster is no longer found 
to be significant when 3D structure is taken into account. The first source, while 
already rare will become even more so in the future as more detailed structural 
information becomes available. As for the second source, when a cluster is 
not identified under iPAC when compared to NMC, an overlapping or nearby 
cluster is typically found (as shown in Tables Hal and l4b|) . For BRAF specifically, 
there was a total of 3 structures where iPAC found fewer clusters than NMC. 
Further, eve ry "possibly" or "prob ably damaging" mutation, as categorized by 
PolyPhen-2 (|Adzhubei et aZ.Ll2010h . was still represented in at least one cluster 
in each structure. Thus, in the case of BRAF, none of the damaging mutations 
identified by PolyPhen-2 were lost. For a more detailed analysis, please see 
"Potential Driver Loss" in the supplementary materials. Ultimately, further 
research is required to further reduce the possibility of losing driver mutations 
while taking into account tertiary structure. 

In conclusion, we present a novel approach to identifying mutation clus- 
tering while taking into account protein tertiary structure. We further show 
that by taking into account tertiary structure we are able to detect clusters 
that would otherwise be missed. Next, we demonstrate that for some of the 
clusters found, pharmacological intervention has already been successfully ap- 
plied, further confirming the hypothesis that mutational clustering might point 
to activating driver mutations. As additional protein structures continue to be 
solved, iPAC would be able to rapidly perform a statistical analysis to identify 
such potential mutations. Finally, as we gain a better understanding of the 
tertiary structure of DNA, this method might also have applications to finding 
mutational clustering on the DNA level. 
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